Problem Description

Forecasting is an important approach to plan the future effectively and efficiently. A time series is a sequence of data points, typically consisting of successive measurements made over a uniform time interval. Time series forecasting is the use of a model to predict future values based on previously observed values.

A leading retailer in USA, wants to forecast sales for their product categories in their store based on the sales history of each category. Sales forecast has very high influence on the performance of the company’s business and hence these sales forecasts can be used to estimate company’s success or performance in the coming year. Accurate forecasts may lead to better decisions in business.

Sales or revenues forecasting is very important for retail operations . Forecasting of retail sales helps retailer to take necessary measures to plan their budgets or investments in a period (monthly, yearly) among different product categories like women clothing, men clothing and other clothing and at the same time they can plan to minimize revenue loss from unavailability of products by investing accordingly.

Libraries To Import

Loading Data into environment

trainingData <-  read.csv('train.csv', header = T)

Visualization of missing Data

Missing Data visualization will give us overall picture, whether data is sufficient or not to proceed for further analysis.

Splitting Data for 3 categories

Splitting data into 3 product categories, so that each time series can be analyzed.

dfMale <- trainingData[which(trainingData$ProductCategory == "MenClothing"),]
dfMale_REM <- subset(dfMale, select = -c(ProductCategory))
dfFemale <- trainingData[which(trainingData$ProductCategory == "WomenClothing"),]
dfFemale_REM <- subset(dfFemale, select = -c(ProductCategory))
dfOther <- trainingData[which(trainingData$ProductCategory == "OtherClothing"),]
dfOthers_REM <- subset(dfOther, select = -c(ProductCategory))
timeStamp<-as.yearmon(paste(dfOthers_REM$Year, dfOthers_REM$Month), "%Y %m")

Converting product categories sales data into TimeSeries

TS_dfMale_REM<- ts(dfMale_REM$Sales.In.ThousandDollars.,frequency = 12,start = c(2009,1))
TS_dfFemale_REM<- ts(dfFemale_REM$Sales.In.ThousandDollars.,frequency = 12,start = c(2009,1))
TS_dfOthers_REM<- ts(dfOthers_REM$Sales.In.ThousandDollars.,frequency = 12,start = c(2009,1))

Visualization of missing Data for each Product Categories

Men Clothing ~ 4 missing values

Women Clothing ~ 4 missing values

Other Clothing ~ 5 missing values

Imputing TimeSeries Missing Values

Time Series values can be filled in number of ways.

Some of known ways to fill time series:
1)Last Observation Carried Forward- LOCF
2)Next Observation Carried Backward - NOCB
3)Kalman Smoothing
4)Interpolation

Method opted to fill our sales time series is locf (Last Observed Come First). LOCF Description: Replaces each missing value with the most recent present value prior to it

SneekPeak of product sales for Males Category

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter

SneekPeak of product sales for Females Category

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter

SneekPeak of product sales for Others Category

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter

Making Time Series Model for Male Category

Decomposing Product Category MenClothing for visualizing random, seasonal and trend components of TimeSeries

ACF and PACF plots

Split into train and test

Sales_train = subset(dfMale_REM, Year !="2015")
Sales_val = subset(dfMale_REM, Year =="2015")

Sales_traints = ts(Sales_train$Sales.In.ThousandDollars., frequency = 12,start=c(2009,1))
Sales_testts = ts(Sales_val$Sales.In.ThousandDollars., frequency = 12, start = c(2015,1))

Model Building

1) Holt Winters model

Sales_HW <-  HoltWinters(Sales_traints)

#Train Model MAPE
MAPE_train_HW <- mean(abs(Sales_traints-Sales_HW$fitted[,"xhat"])/abs(Sales_traints))*100

#Validation Forecast Mape
Sales_forecast_HW <- forecast:::forecast.HoltWinters(Sales_HW, h=12)
MAPE_test_HW <- mean(abs(Sales_testts - Sales_forecast_HW$mean)/abs(Sales_testts))*100

Visualizing Forecasted Values

2) Auto.Arima model

Sales_autoArima <- auto.arima(Sales_traints, ic='aic')

#Train Model MAPE
MAPE_train_AA <- mean(abs(Sales_traints-Sales_autoArima$x)/abs(Sales_traints))*100

#Validation Forecast Mape
Sales_forecast_AA <- forecast:::forecast.Arima(Sales_autoArima,h=12)
MAPE_test_AA <- mean(abs(Sales_testts-Sales_forecast_AA$mean)/abs(Sales_testts))*100

Visualizing Forecasted Values

3) ETS model

Sales_ETS <- ets(model = "ZMZ",Sales_traints)

#Train Model MAPE
MAPE_train_ETS <- mean(abs(Sales_traints-Sales_ETS$fitted)/abs(Sales_traints))*100

#Validation Forecast Mape
Sales_forecast_ETS <- forecast:::forecast.ets(Sales_ETS,h=12)
MAPE_test_ETS <- mean(abs(Sales_testts-Sales_forecast_ETS$mean)/abs(Sales_testts))*100

Visualizing Forecasted Values

4) thetaf model

Sales_taf <- thetaf(Sales_traints)

#Train Model MAPE
MAPE_train_taf <- mean(abs(Sales_traints-Sales_taf$fitted)/abs(Sales_traints))*100

#Validation Forecast Mape
Sales_forecast_taf <- forecast:::forecast.Arima(Sales_autoArima,h=12)
MAPE_test_taf <- mean(abs(Sales_testts-Sales_forecast_taf$mean)/abs(Sales_testts))*100

Visualizing Forecasted Values

5) xGBoost model

Sales_xg <- forecastxgb:::xgbts(Sales_traints)
## Warning in forecastxgb:::xgbts(Sales_traints): xgbts is deprecated terminology and will soon be removed.
## Please use xgbar instead.
#Train Model MAPE
MAPE_train_xG <- mean(abs(Sales_traints-Sales_xg$y)/abs(Sales_traints))*100

#Validation Forecast Mape
Sales_forecast_xG <- forecast:::forecast(Sales_xg,h=12)
MAPE_test_xG <- mean(abs(Sales_testts-Sales_forecast_xG$mean)/abs(Sales_testts))*100

Visualizing Forecasted Values

Making Time Series Model for Female Category

Decomposing Product Category WomenClothing for visualizing random, seasonal and trend components of TimeSeries

ACF and PACF plots

Split into train and test

FE_Sales_train = subset(dfFemale_REM, Year !="2015")
FE_Sales_test = subset(dfFemale_REM, Year =="2015")

FE_Sales_traints = ts(FE_Sales_train$Sales.In.ThousandDollars., frequency = 12,start=c(2009,1))
FE_Sales_testts = ts(FE_Sales_test$Sales.In.ThousandDollars., frequency = 12, start = c(2015,1))

Model Building

1) Holt Winters model

FE_Sales_HW <-  HoltWinters(FE_Sales_traints)

#Train Model MAPE
FE_MAPE_train_HW <- mean(abs(FE_Sales_traints-FE_Sales_HW$fitted[,"xhat"])/abs(FE_Sales_traints))*100

#Validation Forecast Mape
FE_Sales_forecast_HW <- forecast:::forecast.HoltWinters(FE_Sales_HW, h=12)
FE_MAPE_test_HW <- mean(abs(FE_Sales_testts - FE_Sales_forecast_HW$mean)/abs(FE_Sales_testts))*100

Visualizing Forecasted Values

2) Auto.Arima model

FE_Sales_autoArima = auto.arima(FE_Sales_traints, ic='aic')

#Train Model MAPE
FE_MAPE_train_AA <- mean(abs(FE_Sales_traints-FE_Sales_autoArima$x)/abs(FE_Sales_traints))*100

#Validation Forecast Mape
FE_Sales_forecast_AA <- forecast:::forecast.Arima(FE_Sales_autoArima,h=12)
FE_MAPE_test_AA <- mean(abs(FE_Sales_testts-FE_Sales_forecast_AA$mean)/abs(FE_Sales_testts))*100

Visualizing Forecasted Values

3) ETS model

FE_Sales_ETS <- ets(model = "ZMZ",FE_Sales_traints)

#Train Model MAPE
FE_MAPE_train_ETS <- mean(abs(FE_Sales_traints-FE_Sales_ETS$fitted)/abs(FE_Sales_traints))*100

#Validation Forecast Mape
FE_Sales_forecast_ETS <- forecast:::forecast.ets(FE_Sales_ETS,h=12)
FE_MAPE_test_ETS <- mean(abs(FE_Sales_testts-FE_Sales_forecast_ETS$mean)/abs(FE_Sales_testts))*100

Visualizing Forecasted Values

4) thetaf model

FE_Sales_taf <- thetaf(FE_Sales_traints)

#Train Model MAPE
FE_MAPE_train_taf <- mean(abs(FE_Sales_traints-FE_Sales_taf$fitted)/abs(FE_Sales_traints))*100

#Validation Forecast Mape
FE_Sales_forecast_taf <- forecast:::forecast.Arima(FE_Sales_autoArima,h=12)
FE_MAPE_test_taf <- mean(abs(FE_Sales_testts-FE_Sales_forecast_taf$mean)/abs(FE_Sales_testts))*100

Visualizing Forecasted Values

5) xGBoost model

FE_Sales_xg <- forecastxgb:::xgbts(FE_Sales_traints)
## Warning in forecastxgb:::xgbts(FE_Sales_traints): xgbts is deprecated terminology and will soon be removed.
## Please use xgbar instead.
#Train Model MAPE
FE_MAPE_train_xG <- mean(abs(FE_Sales_traints-FE_Sales_xg$y)/abs(FE_Sales_traints))*100

#Validation Forecast Mape
FE_Sales_forecast_xG <- forecast:::forecast(FE_Sales_xg,h=12)
FE_MAPE_test_xG <- mean(abs(FE_Sales_testts-FE_Sales_forecast_xG$mean)/abs(FE_Sales_testts))*100

Visualizing Forecasted Values

Making Time Series Model for Other Category

Decomposing Product Category Other Clothing for visualizing random, seasonal and trend components of TimeSeries

ACF and PACF plots

Split into train and test

OT_Sales_train = subset(dfOthers_REM, Year !="2015")
OT_Sales_test = subset(dfOthers_REM, Year =="2015")

OT_Sales_traints = ts(OT_Sales_train$Sales.In.ThousandDollars., frequency = 12,start=c(2009,1))
OT_Sales_testts = ts(OT_Sales_test$Sales.In.ThousandDollars., frequency = 12, start = c(2015,1))

Model Building

1) Holt Winters model

OT_Sales_HW <-  HoltWinters(OT_Sales_traints)

#Train Model MAPE
OT_MAPE_train_HW <- mean(abs(OT_Sales_traints-OT_Sales_HW$fitted[,"xhat"])/abs(OT_Sales_traints))*100

#Validation Forecast Mape
OT_Sales_forecast_HW <- forecast:::forecast.HoltWinters(OT_Sales_HW, h=12)
OT_MAPE_test_HW <- mean(abs(OT_Sales_testts - OT_Sales_forecast_HW$mean)/abs(OT_Sales_testts))*100

Visualizing Forecasted Values

2) Auto.Arima model

OT_Sales_autoArima <- auto.arima(OT_Sales_traints, ic='aic')

#Train Model MAPE
OT_MAPE_train_AA <- mean(abs(OT_Sales_traints-OT_Sales_autoArima$x)/abs(OT_Sales_traints))*100

#Validation Forecast Mape
OT_Sales_forecast_AA <- forecast:::forecast.Arima(OT_Sales_autoArima,h=12)
OT_MAPE_test_AA <- mean(abs(OT_Sales_testts-OT_Sales_forecast_AA$mean)/abs(OT_Sales_testts))*100

Visualizing Forecasted Values

3) ETS model

OT_Sales_ETS <- ets(model = "ZMZ",OT_Sales_traints)

#Train Model MAPE
OT_MAPE_train_ETS <- mean(abs(OT_Sales_traints-OT_Sales_ETS$fitted)/abs(OT_Sales_traints))*100

#Validation Forecast Mape
OT_Sales_forecast_ETS <- forecast:::forecast.ets(OT_Sales_ETS,h=12)
OT_MAPE_test_ETS <- mean(abs(OT_Sales_testts-OT_Sales_forecast_ETS$mean)/abs(OT_Sales_testts))*100

Visualizing Forecasted Values

4) thetaf model

OT_Sales_taf <- thetaf(OT_Sales_traints)

#Train Model MAPE
OT_MAPE_train_taf <- mean(abs(OT_Sales_traints-OT_Sales_taf$fitted)/abs(OT_Sales_traints))*100

#Validation Forecast Mape
OT_Sales_forecast_taf <- forecast:::forecast.Arima(OT_Sales_autoArima,h=12)
OT_MAPE_test_taf <- mean(abs(OT_Sales_testts-OT_Sales_forecast_taf$mean)/abs(OT_Sales_testts))*100

Visualizing Forecasted Values

5) xGBoost model

OT_Sales_xg <- forecastxgb:::xgbts(OT_Sales_traints)
## Warning in forecastxgb:::xgbts(OT_Sales_traints): xgbts is deprecated terminology and will soon be removed.
## Please use xgbar instead.
#Train Model MAPE
OT_MAPE_train_xG <- mean(abs(OT_Sales_traints-Sales_xg$y)/abs(OT_Sales_traints))*100

#Validation Forecast Mape
OT_Sales_forecast_xG <- forecast:::forecast(OT_Sales_xg,h=12)
OT_MAPE_test_xG <- mean(abs(OT_Sales_testts-OT_Sales_forecast_xG$mean)/abs(OT_Sales_testts))*100

Visualizing Forecasted Values

Model Evaluation DataFrameProcessing

men <-list(c(MAPE_train_HW,MAPE_test_HW),
            c( MAPE_train_AA,MAPE_test_AA),
            c( MAPE_train_ETS,MAPE_test_ETS),
            c( MAPE_train_taf,MAPE_test_taf),
             c(MAPE_train_xG,MAPE_test_xG))
women <-list(c(FE_MAPE_train_HW,FE_MAPE_test_HW),
            c( FE_MAPE_train_AA,FE_MAPE_test_AA),
            c( FE_MAPE_train_ETS,FE_MAPE_test_ETS),
            c( FE_MAPE_train_taf,FE_MAPE_test_taf),
             c(FE_MAPE_train_xG,FE_MAPE_test_xG))
other <-list(c(OT_MAPE_train_HW,OT_MAPE_test_HW),
            c( OT_MAPE_train_AA,OT_MAPE_test_AA),
            c( OT_MAPE_train_ETS,OT_MAPE_test_ETS),
            c( OT_MAPE_train_taf,OT_MAPE_test_taf),
             c(OT_MAPE_train_xG,OT_MAPE_test_xG))
df<- as.data.frame(c(men,women,other))
#write.csv(df,"heello.csv")
df<-read.csv("heello.csv")

Model Evaluation

df
##    ProductCategory Model     Train Validation
## 1              Men    HW  3.208222   6.174506
## 2              Men    AA  0.000000   6.280798
## 3              Men   ETS  2.948638   6.866537
## 4              Men   TAF 11.849201   6.280798
## 5              Men    XG  0.000000   7.266726
## 6            Women    HW  5.296158  16.438342
## 7            Women    AA  0.000000  18.646848
## 8            Women   ETS  3.776894   5.927672
## 9            Women   TAF 11.838242  18.646848
## 10           Women    XG  0.000000  11.014408
## 11           Other    HW  3.754543  15.141701
## 12           Other    AA  0.000000   5.829530
## 13           Other   ETS  3.719118  13.365118
## 14           Other   TAF  7.940802   5.829530
## 15           Other    XG 39.063186   9.922108

Observation

Clearly xgBoost overfits the Training Data for both Men and Women but intrestingly models goes off for Others category. Going by validation result best model for
a)Men -> HoltWinter
b)Women -> ETS
c)Others -> AutoArima

Point To Ponder:

Whereas Ensemble of these 3 individual models gives poor result in Grader as compared to xGboost.